What do Deep Networks Like to Hear?¶

This work is based on the paper What do Deep Networks Like to See? by Palacio et al. in which the authors analyse CNN networks by finetuning an image-autoencoder on the gradients of a fixed classifier. To do so they constructed a pipeline where the input image is first fed through the autoencoder and the resulting image-reconstruction is then passed to the fixed classifier to obtain the final image class predictions. The weights of the autoencoder are then updated by the gradients of the prediction error which is backpropagated all the way through the frozen image classifier, the reconstructed input image and finally through the autoencoder.

This work extends this idea of classifier analysis to audio waveforms and acoustic scene classification. For this a pre-trained audio waveform autoencoder is finetuned to analyse 3 classifiers with different architectures on the ESC50 dataset. The autoencoder is taken from the ArchiSound GitHub repository and the classifiers are from the EfficienAT GitHub and PaSST GitHub repositories.

The Dataset¶

For the experiments the ESC50 dataset was used. It consists of 2,000 environmental sound recordings. Each audio file is 5 seconds long and belongs to one of 50 classes. These classes can then be further clustered into 5 major categories:

Animals Natural soundscapes & water sounds Human, non-speech sounds Interior/domestic sounds Exterior/urban noises
Dog Rain Crying baby Door knock Helicopter
Rooster Sea waves Sneezing Mouse click Chainsaw
Pig Crackling fire Clapping Keyboard typing Siren
Cow Crickets Breathing Door, wood creaks Car horn
Frog Chirping birds Coughing Can opening Engine
Cat Water drops Footsteps Washing machine Train
Hen Wind Laughing Vacuum cleaner Church bells
Insects (flying) Pouring water Brushing teeth Clock alarm Airplane
Sheep Toilet flush Snoring Clock tick Fireworks
Crow Thunderstorm Drinking, sipping Glass breaking Hand saw

The original dataset uses a sample rate of 44.1 kHz but in this work the version provided by the PaSST repository is used where the audio files are resampled to 32 kHz. Furthermore, the dataset is organized in 5 folds of which folds 2 to 5 are used for training and fold 1 is reserved for validation.

Classifiers¶

In this work 3 classifiers with 3 different architectures are analysed:

  • MobileNet
  • Dynamic MobileNet
  • PaSST

The MobileNet is an efficient CNN used in this scenario to classify mel-spectrograms of audio waveforms. The Dynamic MobileNet modifies the plain architecture by introducing dynamic elements that enable attention in the model. The PaSST model is a transformer based sound classifier also operating on mel-spectrogram level. The MobileNet as well as the Dynamic MobileNet were obtained through knowledge distillation based on the PaSST model and then finetuned on the ESC50 dataset.

Audio samples¶

Below, a selection of audio samples from the ESC50 dataset is presented, accompanied by their corresponding reconstructions generated by the different autoencoder models.

Autoencoder pretrained on ESC50¶

The following audiosamples stem for pre-trained audio autoencoder which were then fine-tuned on the classification gradients of the different classifiers.

Index 8 - crow

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 12 - clapping

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 31 - water drops

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 66 - crackling fire

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 86 - insects

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 102 - pig

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Autoencoder trained from scratch¶

The following samples stem from autoencoder which were randomly initialized and the fine-tuned on the classification gradients of the corresponding classifiers.

Index 8 - crow

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 12 - clapping

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 31 - water drops

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 66 - crackling fire

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 86 - insects

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.

Index 102 - pig

Original

Your browser does not support the audio element.

MN Autoencoder

Your browser does not support the audio element.

DyMN Autoencoder

Your browser does not support the audio element.

PaSST Autoencoder

Your browser does not support the audio element.